This section will load all of the libraries that are required to conduct the data analysis of our dataset.
library(tidyverse)
library(dplyr)
library(ggplot2)
library(psych)
library(patchwork)
library(naniar)
library(cowplot)
library(caret)
library(patchwork)
library(forcats)
library(ggrepel)
library(naniar)
library(rpart)
library(caret)
library(caTools)
library(party)
library(magrittr)
library(e1071)
library(ROSE)
library(caret)
library(psych)
library(cowplot)
library(GGally)
knitr::opts_chunk$set(fig.width=20, fig.height=15)
url <- "https://raw.githubusercontent.com/jldbc/coffee-quality-database/master/data/arabica_data_cleaned.csv"
coffee <- read.csv(url)
Using the URL the data set is loaded and stored as the variable name url.
str(coffee)
## 'data.frame': 1311 obs. of 44 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Species : chr "Arabica" "Arabica" "Arabica" "Arabica" ...
## $ Owner : chr "metad plc" "metad plc" "grounds for health admin" "yidnekachew dabessa" ...
## $ Country.of.Origin : chr "Ethiopia" "Ethiopia" "Guatemala" "Ethiopia" ...
## $ Farm.Name : chr "metad plc" "metad plc" "san marcos barrancas \"san cristobal cuch" "yidnekachew dabessa coffee plantation" ...
## $ Lot.Number : chr "" "" "" "" ...
## $ Mill : chr "metad plc" "metad plc" "" "wolensu" ...
## $ ICO.Number : chr "2014/2015" "2014/2015" "" "" ...
## $ Company : chr "metad agricultural developmet plc" "metad agricultural developmet plc" "" "yidnekachew debessa coffee plantation" ...
## $ Altitude : chr "1950-2200" "1950-2200" "1600 - 1800 m" "1800-2200" ...
## $ Region : chr "guji-hambela" "guji-hambela" "" "oromia" ...
## $ Producer : chr "METAD PLC" "METAD PLC" "" "Yidnekachew Dabessa Coffee Plantation" ...
## $ Number.of.Bags : int 300 300 5 320 300 100 100 300 300 50 ...
## $ Bag.Weight : chr "60 kg" "60 kg" "1" "60 kg" ...
## $ In.Country.Partner : chr "METAD Agricultural Development plc" "METAD Agricultural Development plc" "Specialty Coffee Association" "METAD Agricultural Development plc" ...
## $ Harvest.Year : chr "2014" "2014" "" "2014" ...
## $ Grading.Date : chr "April 4th, 2015" "April 4th, 2015" "May 31st, 2010" "March 26th, 2015" ...
## $ Owner.1 : chr "metad plc" "metad plc" "Grounds for Health Admin" "Yidnekachew Dabessa" ...
## $ Variety : chr "" "Other" "Bourbon" "" ...
## $ Processing.Method : chr "Washed / Wet" "Washed / Wet" "" "Natural / Dry" ...
## $ Aroma : num 8.67 8.75 8.42 8.17 8.25 8.58 8.42 8.25 8.67 8.08 ...
## $ Flavor : num 8.83 8.67 8.5 8.58 8.5 8.42 8.5 8.33 8.67 8.58 ...
## $ Aftertaste : num 8.67 8.5 8.42 8.42 8.25 8.42 8.33 8.5 8.58 8.5 ...
## $ Acidity : num 8.75 8.58 8.42 8.42 8.5 8.5 8.5 8.42 8.42 8.5 ...
## $ Body : num 8.5 8.42 8.33 8.5 8.42 8.25 8.25 8.33 8.33 7.67 ...
## $ Balance : num 8.42 8.42 8.42 8.25 8.33 8.33 8.25 8.5 8.42 8.42 ...
## $ Uniformity : num 10 10 10 10 10 10 10 10 9.33 10 ...
## $ Clean.Cup : num 10 10 10 10 10 10 10 10 10 10 ...
## $ Sweetness : num 10 10 10 10 10 10 10 9.33 9.33 10 ...
## $ Cupper.Points : num 8.75 8.58 9.25 8.67 8.58 8.33 8.5 9 8.67 8.5 ...
## $ Total.Cup.Points : num 90.6 89.9 89.8 89 88.8 ...
## $ Moisture : num 0.12 0.12 0 0.11 0.12 0.11 0.11 0.03 0.03 0.1 ...
## $ Category.One.Defects : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Quakers : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Color : chr "Green" "Green" "" "Green" ...
## $ Category.Two.Defects : int 0 1 0 2 2 1 0 0 0 4 ...
## $ Expiration : chr "April 3rd, 2016" "April 3rd, 2016" "May 31st, 2011" "March 25th, 2016" ...
## $ Certification.Body : chr "METAD Agricultural Development plc" "METAD Agricultural Development plc" "Specialty Coffee Association" "METAD Agricultural Development plc" ...
## $ Certification.Address: chr "309fcf77415a3661ae83e027f7e5f05dad786e44" "309fcf77415a3661ae83e027f7e5f05dad786e44" "36d0d00a3724338ba7937c52a378d085f2172daa" "309fcf77415a3661ae83e027f7e5f05dad786e44" ...
## $ Certification.Contact: chr "19fef5a731de2db57d16da10287413f5f99bc2dd" "19fef5a731de2db57d16da10287413f5f99bc2dd" "0878a7d4b9d35ddbf0fe2ce69a2062cceb45a660" "19fef5a731de2db57d16da10287413f5f99bc2dd" ...
## $ unit_of_measurement : chr "m" "m" "m" "m" ...
## $ altitude_low_meters : num 1950 1950 1600 1800 1950 ...
## $ altitude_high_meters : num 2200 2200 1800 2200 2200 NA NA 1700 1700 1850 ...
## $ altitude_mean_meters : num 2075 2075 1700 2000 2075 ...
Our dataset consists of only Arabica coffee species. The features that are included in the data set are :
In the next following section, we will clean the data based on the various issues we had encountered during the preliminary exploratory data analysis. The following processes will be implemented in order to prepare the data set for the appropriate analysis:
glimpse(coffee)
## Rows: 1,311
## Columns: 44
## $ X <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1…
## $ Species <chr> "Arabica", "Arabica", "Arabica", "Arabica", "Ara…
## $ Owner <chr> "metad plc", "metad plc", "grounds for health ad…
## $ Country.of.Origin <chr> "Ethiopia", "Ethiopia", "Guatemala", "Ethiopia",…
## $ Farm.Name <chr> "metad plc", "metad plc", "san marcos barrancas …
## $ Lot.Number <chr> "", "", "", "", "", "", "", "", "", "", "", "", …
## $ Mill <chr> "metad plc", "metad plc", "", "wolensu", "metad …
## $ ICO.Number <chr> "2014/2015", "2014/2015", "", "", "2014/2015", "…
## $ Company <chr> "metad agricultural developmet plc", "metad agri…
## $ Altitude <chr> "1950-2200", "1950-2200", "1600 - 1800 m", "1800…
## $ Region <chr> "guji-hambela", "guji-hambela", "", "oromia", "g…
## $ Producer <chr> "METAD PLC", "METAD PLC", "", "Yidnekachew Dabes…
## $ Number.of.Bags <int> 300, 300, 5, 320, 300, 100, 100, 300, 300, 50, 3…
## $ Bag.Weight <chr> "60 kg", "60 kg", "1", "60 kg", "60 kg", "30 kg"…
## $ In.Country.Partner <chr> "METAD Agricultural Development plc", "METAD Agr…
## $ Harvest.Year <chr> "2014", "2014", "", "2014", "2014", "2013", "201…
## $ Grading.Date <chr> "April 4th, 2015", "April 4th, 2015", "May 31st,…
## $ Owner.1 <chr> "metad plc", "metad plc", "Grounds for Health Ad…
## $ Variety <chr> "", "Other", "Bourbon", "", "Other", "", "Other"…
## $ Processing.Method <chr> "Washed / Wet", "Washed / Wet", "", "Natural / D…
## $ Aroma <dbl> 8.67, 8.75, 8.42, 8.17, 8.25, 8.58, 8.42, 8.25, …
## $ Flavor <dbl> 8.83, 8.67, 8.50, 8.58, 8.50, 8.42, 8.50, 8.33, …
## $ Aftertaste <dbl> 8.67, 8.50, 8.42, 8.42, 8.25, 8.42, 8.33, 8.50, …
## $ Acidity <dbl> 8.75, 8.58, 8.42, 8.42, 8.50, 8.50, 8.50, 8.42, …
## $ Body <dbl> 8.50, 8.42, 8.33, 8.50, 8.42, 8.25, 8.25, 8.33, …
## $ Balance <dbl> 8.42, 8.42, 8.42, 8.25, 8.33, 8.33, 8.25, 8.50, …
## $ Uniformity <dbl> 10.00, 10.00, 10.00, 10.00, 10.00, 10.00, 10.00,…
## $ Clean.Cup <dbl> 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, …
## $ Sweetness <dbl> 10.00, 10.00, 10.00, 10.00, 10.00, 10.00, 10.00,…
## $ Cupper.Points <dbl> 8.75, 8.58, 9.25, 8.67, 8.58, 8.33, 8.50, 9.00, …
## $ Total.Cup.Points <dbl> 90.58, 89.92, 89.75, 89.00, 88.83, 88.83, 88.75,…
## $ Moisture <dbl> 0.12, 0.12, 0.00, 0.11, 0.12, 0.11, 0.11, 0.03, …
## $ Category.One.Defects <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Quakers <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Color <chr> "Green", "Green", "", "Green", "Green", "Bluish-…
## $ Category.Two.Defects <int> 0, 1, 0, 2, 2, 1, 0, 0, 0, 4, 1, 0, 0, 2, 2, 0, …
## $ Expiration <chr> "April 3rd, 2016", "April 3rd, 2016", "May 31st,…
## $ Certification.Body <chr> "METAD Agricultural Development plc", "METAD Agr…
## $ Certification.Address <chr> "309fcf77415a3661ae83e027f7e5f05dad786e44", "309…
## $ Certification.Contact <chr> "19fef5a731de2db57d16da10287413f5f99bc2dd", "19f…
## $ unit_of_measurement <chr> "m", "m", "m", "m", "m", "m", "m", "m", "m", "m"…
## $ altitude_low_meters <dbl> 1950.0, 1950.0, 1600.0, 1800.0, 1950.0, NA, NA, …
## $ altitude_high_meters <dbl> 2200.0, 2200.0, 1800.0, 2200.0, 2200.0, NA, NA, …
## $ altitude_mean_meters <dbl> 2075.0, 2075.0, 1700.0, 2000.0, 2075.0, NA, NA, …
We will drop the index column. We can also see that some of the values are are stored as “” or ” ” so it is best to convert them into NAs.
# Drop the index in the data set
coffee <- coffee[, -1]
## Remove empty spaces and replace with NA
coffee[coffee == ""] <- NA
coffee[coffee == " "] <- NA
We will also make sure that there are no duplicates by using the distinct function.
coffee <- distinct(coffee)
In the following section
During the preliminary exploratory data analysis a value of 0 in the Total.Cup.Point was skewing the variable.
glimpse(coffee[coffee$Total.Cup.Points == 0, ])
## Rows: 1
## Columns: 43
## $ Species <chr> "Arabica"
## $ Owner <chr> "bismarck castro"
## $ Country.of.Origin <chr> "Honduras"
## $ Farm.Name <chr> "los hicaques"
## $ Lot.Number <chr> "103"
## $ Mill <chr> "cigrah s.a de c.v."
## $ ICO.Number <chr> "13-111-053"
## $ Company <chr> "cigrah s.a de c.v"
## $ Altitude <chr> "1400"
## $ Region <chr> "comayagua"
## $ Producer <chr> "Reinerio Zepeda"
## $ Number.of.Bags <int> 275
## $ Bag.Weight <chr> "69 kg"
## $ In.Country.Partner <chr> "Instituto Hondureño del Café"
## $ Harvest.Year <chr> "2017"
## $ Grading.Date <chr> "April 28th, 2017"
## $ Owner.1 <chr> "Bismarck Castro"
## $ Variety <chr> "Caturra"
## $ Processing.Method <chr> NA
## $ Aroma <dbl> 0
## $ Flavor <dbl> 0
## $ Aftertaste <dbl> 0
## $ Acidity <dbl> 0
## $ Body <dbl> 0
## $ Balance <dbl> 0
## $ Uniformity <dbl> 0
## $ Clean.Cup <dbl> 0
## $ Sweetness <dbl> 0
## $ Cupper.Points <dbl> 0
## $ Total.Cup.Points <dbl> 0
## $ Moisture <dbl> 0.12
## $ Category.One.Defects <int> 0
## $ Quakers <int> 0
## $ Color <chr> "Green"
## $ Category.Two.Defects <int> 2
## $ Expiration <chr> "April 28th, 2018"
## $ Certification.Body <chr> "Instituto Hondureño del Café"
## $ Certification.Address <chr> "b4660a57e9f8cc613ae5b8f02bfce8634c763ab4"
## $ Certification.Contact <chr> "7f521ca403540f81ec99daec7da19c2788393880"
## $ unit_of_measurement <chr> "m"
## $ altitude_low_meters <dbl> 1400
## $ altitude_high_meters <dbl> 1400
## $ altitude_mean_meters <dbl> 1400
Total.Cup.Points is a summation of Aroma, Flavor, Aftertaste, Acidity, Body, Balance, Uniformity, Clean.Cup, Sweetness and Cupper.Points. When inspecting the observation it is evident that the data that is used to calculate it are also zero. Thus it would be important to remove the value.
coffee <- coffee[coffee$Total.Cup.Points != 0, , drop = FALSE]
Using the Q-Grading standards, this conversion method the Total.Cup.Points will be converted to four different categories: “Commodity,” “Very Good,” “Excellent,” and the esteemed “Outstanding.” For further clarification on how the process is undertaken and why we have chosen these values click here
The classification process uses a cutoff point of 79 to distinguish “Commodity” grade coffees, meaning those that score lower or are not considered specialty coffee. “Very Good” covers a range of excellent coffees, from 80 to 84.99, and “Excellent” covers a range of 85 to 90, indicating very good qualities. Anything above 90 is classified as “Outstanding”. This enable us to understand the distribution of coffee grades and helps stakeholders identify the differences in quality.
# Conditions for the corresponding labels
conditions <- c(-Inf, 79, 84.99, 90, Inf)
new_labels <- c("Commodity", "Very Good", "Excellent", "Outstanding")
# The column Grade will be based on the condition
coffee$Grade <- cut(coffee$Total.Cup.Points, breaks = conditions, labels = new_labels, right = TRUE)
During the preliminary stages of our exploratory data analysis, a careful review of the data set revealed discrepancies concerning the “altitude_mean_meters” variable. The main difference was that some entries were registered in feet, which introduced inconsistencies. Extensive examination, involving cross-referencing with the unprocessed data set, revealed that these entries were initially registered in feet(ft) instead of undergoing conversion before being registered in altitude_mean_meters.
Moreover, patterns of observation from Brazil and Myanmar were noticed. For example records in Myanmar that were handled by the Coffee Quality Institute were recorded in feet. But they were not appropriately converted before being registered in the altitude_mean_meter column. Furthermore, records from Brazil were recorded as 1 for 1000 meters and 1.2 for 1200 meters. Cross-referencing with the original data set and other resources a correction step was undertaken.
In other instances, when the conversion from the variable Altitude to the altitude_mean_meters was conducted the values were stripped off the decimal point and recorded as 12 in the altitude_mean_meters process.These values were multiplied by 100 to get the correct value. As a result, the purpose of this phase is to address and resolve these found discrepancies, guaranteeing a more accurate and robust analysis.
# Make a copy of the variable and keep in the column altitude_mean_meters_new
coffee$altitude_mean_meters_new = coffee$altitude_mean_meters
We create a copy of a altitude_mean_meters in order to compare and understand where the discrepancies have occurred. Then we will make changes to the new variable altitude_mean_meters_new for the exploratory and analytic purposes. It is important to remember that altitude_mean_meters is a derivation of the altitude_high_meters and altitude_low_meters. These variables are all derived from the column Altitude which was recorded in a hyphenated to depict the range, recorded in feet or recorded abbreviation such as 1.2 for 1200 meters.The column is not consistent and the will be cleaned in this section.
# There was an error in registering the value 190164 by missing a decimal
print(subset(coffee, altitude_mean_meters == 190164, select = c("altitude_mean_meters", "Altitude", "Country.of.Origin","Owner")))
## altitude_mean_meters Altitude Country.of.Origin Owner
## 897 190164 190164 Guatemala juan luis alvarado romero
## 1145 190164 1901.64 Guatemala juan luis alvarado romero
# Correction of error
coffee$altitude_mean_meters_new = ifelse(coffee$altitude_mean_meters_new == 190164, 1901.64, coffee$altitude_mean_meters_new)
We can see that one of the values was properly recorded in the Altitude column, where as the other one was not. But when the values was being processed into the altitude_mean_meters we can see that an error occurred for at least one of them. Furthermore we inspected the region and that saw the altitude was not 190164. So this is corrected in the column altitude_mean_meters_new.
# Create a function to convert the instances where coffee is registered in feet to meters
feet_to_meters <- function(feet) {
conversion_factor <- 0.3048
meters <- feet * conversion_factor
return(meters)
}
# Altitude that was registered by the Coffee Quality Institution and the Country of origin was Myanmar were recorded in feet
condition_1 = coffee$In.Country.Partner == "Coffee Quality Institute"
condition_2 = coffee$Country.of.Origin == "Myanmar"
print(subset(coffee, condition_1 & condition_2, select = c("altitude_mean_meters", "Altitude", "In.Country.Partner")))
## altitude_mean_meters Altitude In.Country.Partner
## 841 4001.0 4001 Coffee Quality Institute
## 916 1219.2 4000 ft Coffee Quality Institute
## 1018 1066.8 3500 ft Coffee Quality Institute
## 1039 3825.0 3825 Coffee Quality Institute
## 1074 3800.0 3800 Coffee Quality Institute
## 1099 4287.0 4287 Coffee Quality Institute
## 1124 3845.0 3845 Coffee Quality Institute
As mentioned earlier, the observations that were composed of the country Myanmar and the Coffee Quality Institute the Altitude was recorded in feet and was stored in altitude_mean_meters column without conversion. In this step we will convert them to feet using the function created above.
# Using the condition to convert from feet to meters
coffee$altitude_mean_meters_new[condition_1 & condition_2] = feet_to_meters(coffee$altitude_mean_meters_new[condition_1 & condition_2])
We have used the condition specified as Myanmar and Coffee Quality Institute to convert these values into the appropriate units.
# Checking the rows
print(coffee[c(216, 838, 1002, 1270), c("Altitude", "altitude_mean_meters", "In.Country.Partner", "Country.of.Origin")])
## Altitude altitude_mean_meters In.Country.Partner
## 216 3280 3280 Asociacion Nacional Del Café
## 838 3280 3280 Asociacion Nacional Del Café
## 1002 3280 3280 Asociacion Nacional Del Café
## 1270 3500 3500 Specialty Coffee Association
## Country.of.Origin
## 216 Guatemala
## 838 Guatemala
## 1002 Guatemala
## 1270 Indonesia
The following rows are also recorded in feet but did not go the apropriate conversion.
# Converting the values to feet
rows_to_update <- c(216, 838, 1002, 1270)
coffee$altitude_mean_meters_new[rows_to_update] <- feet_to_meters(coffee$altitude_mean_meters_new[rows_to_update])
When inspecting the the data set we saw that Guatemala recorded their Altitude in both feet and meters, although most went the appropriate conversion, there were these few that remained.
# These rows had some errors when recording the altitude_mean_meter column
print(coffee[c(544, 629, 1041,1204), c("Altitude", "altitude_mean_meters")])
## Altitude altitude_mean_meters
## 544 11000 metros 11000
## 629 1800 meters (5900 3850
## 1041 1100.00 mosl 110000
## 1204 12oo 12
These values are recorded in meters but there was decimals not being handled appropriately during the conversion stage and one was a data entry error which is associated to the region of cerrados. We know from the data and other resources that the value 11000 is not plausible, and thus conclude that this is a data entry error and correct the value appropriately.
# Correction of data entry and conversion errors
coffee$altitude_mean_meters_new[c(544, 629, 1041, 1204)] <- c(1100, 1800, 1100, 1200)
We have inputted the correct values and fixed and error that was prevalent.
# The following rows were identified
print(coffee[c(482, 280, 614, 684, 738, 762, 781, 839, 840, 878, 964), c("Altitude", "altitude_mean_meters", "Country.of.Origin", "Region")])
## Altitude altitude_mean_meters Country.of.Origin Region
## 482 1 1 Brazil south of minas
## 280 1 1 Brazil south of minas
## 614 1 1 Brazil south of minas
## 684 1 1 Brazil south of minas
## 738 1 1 Brazil south of minas
## 762 1 1 Brazil south of minas
## 781 1 1 Brazil south of minas
## 839 1 1 Brazil south of minas
## 840 1 1 Brazil south of minas
## 878 1 1 Brazil south of minas
## 964 1 1 Brazil south of minas
Recorded as 1 to represent 1000 meters all of the values are from Brazil of the south minas region. From resources we know from the data and other resources that south of minas has an average altitude of 950m.
# We will be storing the rows to be updated and multiply the values by 1000
rows_to_update = c(482, 280, 614, 684, 738, 762, 781, 839, 840, 878, 964)
coffee$altitude_mean_meters_new[rows_to_update] = coffee$altitude_mean_meters_new[rows_to_update] * 1000
We have thus multipled the values with 1000 to convert the single digit recordings into the meters.
# The following rows were converted from 1.2 which represent 1200 meters to 12
# when cleaning the data
print(coffee[c(42, 43, 786, 899), c("Altitude", "altitude_mean_meters", "Country.of.Origin", "Region")])
## Altitude altitude_mean_meters Country.of.Origin
## 42 1.2 12 Brazil
## 43 1.2 12 Brazil
## 786 1.3 13 Costa Rica
## 899 1.3 13 Costa Rica
## Region
## 42 sul de minas - carmo de minas
## 43 sul de minas - carmo de minas
## 786 turrialba
## 899 turrialba
We can see from the result that when the Altitude column was recorded as an abbreviation and thus when the conversion process was conducted it stored these values by removing the decimal point. We will multiply them with 100 and get the appropriate value in terms of meters.
rows_to_update_2 = c(42, 43, 786, 899)
coffee$altitude_mean_meters_new[rows_to_update_2] = coffee$altitude_mean_meters_new[rows_to_update_2] * 100
Now that we have finished handling the altitude_mean_meters and stored all the conversions and error correction into our new column altitude_mean_meters_new column. We will be using this variable for the exploration and analysis of the data.
class(coffee$Bag.Weight)
## [1] "character"
unique(coffee$Bag.Weight)
## [1] "60 kg" "1" "30 kg" "69 kg" "1 kg" "2 kg,lbs"
## [7] "6" "3 lbs" "50 kg" "2 lbs" "100 lbs" "15 kg"
## [13] "2 kg" "2" "70 kg" "19200 kg" "5 lbs" "1 kg,lbs"
## [19] "6 kg" "0 lbs" "46 kg" "40 kg" "20 kg" "34 kg"
## [25] "1 lbs" "660 kg" "18975 kg" "12000 kg" "35 kg" "66 kg"
## [31] "80 kg" "132 lbs" "5 kg" "25 kg" "59 kg" "18000 kg"
## [37] "150 lbs" "9000 kg" "18 kg" "10 kg" "29 kg" "1218 kg"
## [43] "4 lbs" "0 kg" "13800 kg" "1500 kg" "24 kg" "80 lbs"
## [49] "8 kg" "3 kg" "350 kg" "67 kg" "4 kg" "55 lbs"
## [55] "100 kg" "130 lbs"
We can see that bag weight is stored as a character with different measurements, thus it is important to convert this into numeric and the same unit measurement. We have used the metric standard and we will convert all of the values that are in pound(lbs) to kilogram(kg) and round them two decimal places. Let us create a new column with all the values in the Bag.Weight column and then change them to numeric and store it as Num.Bag.Weight.
# We will create a new variable called Num.Bag.Weight to convert the characters to numeric value.
coffee$Num.Bag.Weight = coffee$Bag.Weight
Since we know that the value in Bag.Weight is stored in kg or lbs we will be converting the lbs to kg.
#Treating the Bag.Weight variable by converting its lbs value to kg and then to numeric
# Identify values in pounds
is_lbs <- grepl(" lbs", coffee$Num.Bag.Weight)
# Identify values in kg
is_kg <-grepl(" kg", coffee$Num.Bag.Weight)
# Remove " lbs" from variable
coffee$Num.Bag.Weight[is_lbs] <- gsub(" lbs", "", coffee$Num.Bag.Weight[is_lbs])
# Remove " kg" from variable
coffee$Num.Bag.Weight[is_kg] <- gsub(" kg", "", coffee$Num.Bag.Weight[is_kg])
# Convert pound values to kg and store them back as numeric value
coffee$Num.Bag.Weight[is_lbs] <- as.numeric(coffee$Num.Bag.Weight[is_lbs]) * 0.45359237
# Convert kg values to numeric
coffee$Num.Bag.Weight[is_kg] <- as.numeric(coffee$Num.Bag.Weight[is_kg])
## Warning: NAs introduced by coercion
# Round the values to decimal places
coffee$Num.Bag.Weight <- round(as.numeric(coffee$Num.Bag.Weight), 2)
Recognizing that there is a relationship between number of bags and bag weight, we can combine these two values in order to give us the number of weight that is sampled. This could help us reduce redundancy and inspect if there is a relationship between the quantity sample and the quality of coffee.
coffee$Total.Weight = (coffee$Num.Bag.Weight * coffee$Number.of.Bags)
coffee_clean <- dplyr::select(coffee, Grade, Total.Cup.Points, Aroma, Flavor, Aftertaste, Acidity, Body, Balance, Uniformity, Clean.Cup,
Sweetness, Cupper.Points, Category.One.Defects,Category.Two.Defects, Quakers, Moisture, altitude_mean_meters_new, Total.Weight, Variety, Color, Processing.Method, Country.of.Origin, In.Country.Partner)
This will be the data set that will be used for further exploration and data analysis. We will use the variables we created such as Grade and Total.Weight. Also we will include the variables that went the corrective process such as altitude_mean_meters_new.
glimpse(coffee_clean)
## Rows: 1,310
## Columns: 23
## $ Grade <fct> Outstanding, Excellent, Excellent, Excellent,…
## $ Total.Cup.Points <dbl> 90.58, 89.92, 89.75, 89.00, 88.83, 88.83, 88.…
## $ Aroma <dbl> 8.67, 8.75, 8.42, 8.17, 8.25, 8.58, 8.42, 8.2…
## $ Flavor <dbl> 8.83, 8.67, 8.50, 8.58, 8.50, 8.42, 8.50, 8.3…
## $ Aftertaste <dbl> 8.67, 8.50, 8.42, 8.42, 8.25, 8.42, 8.33, 8.5…
## $ Acidity <dbl> 8.75, 8.58, 8.42, 8.42, 8.50, 8.50, 8.50, 8.4…
## $ Body <dbl> 8.50, 8.42, 8.33, 8.50, 8.42, 8.25, 8.25, 8.3…
## $ Balance <dbl> 8.42, 8.42, 8.42, 8.25, 8.33, 8.33, 8.25, 8.5…
## $ Uniformity <dbl> 10.00, 10.00, 10.00, 10.00, 10.00, 10.00, 10.…
## $ Clean.Cup <dbl> 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 1…
## $ Sweetness <dbl> 10.00, 10.00, 10.00, 10.00, 10.00, 10.00, 10.…
## $ Cupper.Points <dbl> 8.75, 8.58, 9.25, 8.67, 8.58, 8.33, 8.50, 9.0…
## $ Category.One.Defects <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Category.Two.Defects <int> 0, 1, 0, 2, 2, 1, 0, 0, 0, 4, 1, 0, 0, 2, 2, …
## $ Quakers <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Moisture <dbl> 0.12, 0.12, 0.00, 0.11, 0.12, 0.11, 0.11, 0.0…
## $ altitude_mean_meters_new <dbl> 2075.0, 2075.0, 1700.0, 2000.0, 2075.0, NA, N…
## $ Total.Weight <dbl> 18000, 18000, 5, 19200, 18000, 3000, 6900, 18…
## $ Variety <chr> NA, "Other", "Bourbon", NA, "Other", NA, "Oth…
## $ Color <chr> "Green", "Green", NA, "Green", "Green", "Blui…
## $ Processing.Method <chr> "Washed / Wet", "Washed / Wet", NA, "Natural …
## $ Country.of.Origin <chr> "Ethiopia", "Ethiopia", "Guatemala", "Ethiopi…
## $ In.Country.Partner <chr> "METAD Agricultural Development plc", "METAD …
Let us examine which values has the most missing values and if we can conduct any simple methods of handling this instances. We will use a simple proportion count using.
# Creating a function that ingests the data set and gives us a summary of the missing values. It will calculate the Count and the Percentage and arrange it in descending order
missing_values_summary <- function(data) {
data %>%
gather(key = "Variables", value = "Values") %>%
group_by(Variables) %>%
summarise(
Count = sum(is.na(Values)),
Percentage = mean(is.na(Values)) * 100
) %>%
arrange(desc(Percentage))
}
missing_values_summary(coffee_clean)
## Warning: attributes are not identical across measure variables; they will be
## dropped
## # A tibble: 23 × 3
## Variables Count Percentage
## <chr> <int> <dbl>
## 1 altitude_mean_meters_new 227 17.3
## 2 Color 216 16.5
## 3 Variety 201 15.3
## 4 Processing.Method 151 11.5
## 5 Total.Weight 2 0.153
## 6 Country.of.Origin 1 0.0763
## 7 Quakers 1 0.0763
## 8 Acidity 0 0
## 9 Aftertaste 0 0
## 10 Aroma 0 0
## # ℹ 13 more rows
The column altitude_mean_meters has the highest level of missing values followed by Color, Variety and Processing Method. For now we will only handle the categorical variables by replacing the missing values with the factor “Unknown”.
# Replace the missing categorical NAs with "Unknown"
coffee_clean <- coffee_clean %>%
mutate(
Color = coalesce(Color, "Unknown"),
Variety = coalesce(Variety, "Unknown"),
Processing.Method = coalesce(Processing.Method, "Unknown"),
)
vis_miss(coffee_clean)
We found one instance where the Country of Origin is missing Region is
there so we will be substituting this variable withe the correct Country
of Origin.
head(coffee[is.na(coffee$Country.of.Origin), ])
## Species Owner Country.of.Origin Farm.Name Lot.Number Mill
## 1198 Arabica racafe & cia s.c.a <NA> <NA> <NA> <NA>
## ICO.Number Company Altitude Region Producer Number.of.Bags Bag.Weight
## 1198 3-37-1980 <NA> <NA> <NA> <NA> 149 70 kg
## In.Country.Partner Harvest.Year Grading.Date Owner.1 Variety
## 1198 Almacafé <NA> March 1st, 2011 Racafe & Cia S.C.A <NA>
## Processing.Method Aroma Flavor Aftertaste Acidity Body Balance Uniformity
## 1198 <NA> 6.75 6.75 6.42 6.83 7.58 7.5 10
## Clean.Cup Sweetness Cupper.Points Total.Cup.Points Moisture
## 1198 10 10 7.25 79.08 0.1
## Category.One.Defects Quakers Color Category.Two.Defects
## 1198 0 0 <NA> 3
## Expiration Certification.Body
## 1198 February 29th, 2012 Almacafé
## Certification.Address
## 1198 e493c36c2d076bf273064f7ac23ad562af257a25
## Certification.Contact unit_of_measurement
## 1198 70d3c0c26f89e00fdae6fb39ff54f0d2eb1c38ab m
## altitude_low_meters altitude_high_meters altitude_mean_meters Grade
## 1198 NA NA NA Very Good
## altitude_mean_meters_new Num.Bag.Weight Total.Weight
## 1198 NA 70 10430
coffee_clean$Country.of.Origin <- ifelse(is.na(coffee_clean$Country.of.Origin), "Colombia",coffee_clean$Country.of.Origin)
The variables Variety, Color, Processing.Methods, Country.of.Origin and In.Country.Partner are stored as character and should be converted to factor.
convert_characters_to_factors <- function(data) {
# Identify character columns
char_cols <- sapply(data, is.character)
# Convert character columns to factors
data[char_cols] <- lapply(data[char_cols], as.factor)
# Return the modified data frame
return(data)
}
coffee_clean <- convert_characters_to_factors(coffee_clean)
glimpse(coffee_clean)
## Rows: 1,310
## Columns: 23
## $ Grade <fct> Outstanding, Excellent, Excellent, Excellent,…
## $ Total.Cup.Points <dbl> 90.58, 89.92, 89.75, 89.00, 88.83, 88.83, 88.…
## $ Aroma <dbl> 8.67, 8.75, 8.42, 8.17, 8.25, 8.58, 8.42, 8.2…
## $ Flavor <dbl> 8.83, 8.67, 8.50, 8.58, 8.50, 8.42, 8.50, 8.3…
## $ Aftertaste <dbl> 8.67, 8.50, 8.42, 8.42, 8.25, 8.42, 8.33, 8.5…
## $ Acidity <dbl> 8.75, 8.58, 8.42, 8.42, 8.50, 8.50, 8.50, 8.4…
## $ Body <dbl> 8.50, 8.42, 8.33, 8.50, 8.42, 8.25, 8.25, 8.3…
## $ Balance <dbl> 8.42, 8.42, 8.42, 8.25, 8.33, 8.33, 8.25, 8.5…
## $ Uniformity <dbl> 10.00, 10.00, 10.00, 10.00, 10.00, 10.00, 10.…
## $ Clean.Cup <dbl> 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 1…
## $ Sweetness <dbl> 10.00, 10.00, 10.00, 10.00, 10.00, 10.00, 10.…
## $ Cupper.Points <dbl> 8.75, 8.58, 9.25, 8.67, 8.58, 8.33, 8.50, 9.0…
## $ Category.One.Defects <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Category.Two.Defects <int> 0, 1, 0, 2, 2, 1, 0, 0, 0, 4, 1, 0, 0, 2, 2, …
## $ Quakers <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Moisture <dbl> 0.12, 0.12, 0.00, 0.11, 0.12, 0.11, 0.11, 0.0…
## $ altitude_mean_meters_new <dbl> 2075.0, 2075.0, 1700.0, 2000.0, 2075.0, NA, N…
## $ Total.Weight <dbl> 18000, 18000, 5, 19200, 18000, 3000, 6900, 18…
## $ Variety <fct> Unknown, Other, Bourbon, Unknown, Other, Unkn…
## $ Color <fct> Green, Green, Unknown, Green, Green, Bluish-G…
## $ Processing.Method <fct> Washed / Wet, Washed / Wet, Unknown, Natural …
## $ Country.of.Origin <fct> "Ethiopia", "Ethiopia", "Guatemala", "Ethiopi…
## $ In.Country.Partner <fct> METAD Agricultural Development plc, METAD Agr…
Evaluating Total.Cup.Points we will use the describe function in order to understand the summary of the data set in terms of mean, median, its skeweness and kurtosis. In addition, we will use a function to combined plots of a density, histogram and a box plot to get a visual understanding of how our target variable is composed.
# Create a function that enables you to plot a density, histogram, and a box plot. This will be predominantly used for the numerical variables.
plot_combined_stats <- function(data, variable, bins = 30) {
# Create the density plot
density_plot <- ggplot(data, aes(x = .data[[variable]])) +
geom_density(fill = "orange", alpha = 0.6) +
stat_function(fun = dnorm, args = list(mean = mean(data[[variable]]), sd = sd(data[[variable]])),
color = "blue", linetype = "dashed") +
theme_minimal() +
labs(x = variable, y = "Density") +
theme(
axis.title.y = element_text(vjust = 0.5, hjust = -0.2, angle = 0),
axis.title.x = element_text(hjust = 0.5)
)
# Create the histogram
histogram <- ggplot(data, aes(x = .data[[variable]])) +
geom_histogram(binwidth = bins, fill = "orange", color = "black") +
theme_minimal() +
theme(
panel.background = element_rect(fill = "white")
) +
labs(x = variable, y = "Count") +
theme_classic()
# Create the boxplot
boxplot <- ggplot(data, aes(y = .data[[variable]])) +
geom_boxplot(fill = "orange", color = "black") +
theme_minimal() +
theme(
panel.background = element_rect(fill = "white")
) +
labs(x = "", y = variable) +
theme(axis.text.x = element_blank(), axis.ticks.x = element_blank())
# Combine the plots using cowplot
combined_plot <- plot_grid(density_plot, histogram, boxplot, ncol = 3)
# Output the combined plot
print(combined_plot)
}
Examining the
describe(coffee_clean$Total.Cup.Points)
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 1310 82.18 2.69 82.5 82.41 1.85 59.83 90.58 30.75 -1.84 8.99
## se
## X1 0.07
total_cup_points_plot = plot_combined_stats(coffee_clean, "Total.Cup.Points", bins =1)
The describe function indicates that the mean and the median of
Total.Cup.Points are close to each other which indicates symmetry in
distribution. The standard deviation is 2.68 indicating low variability
from the mean. Kurtosis is greater than 8.99 suggesting fatter tails and
the skewness of -1.84 indicates its is negatively skewed.
Just by visualizing we can see that there is a peak of the Total.Cup.Points a little above the score of 80 and we can also see that the variable has many outliers, especially on the lower whisker. We can see lowest value is 60 for the score and this skews sighyly to the left. There don’t seem to be much variability in the data from the mean, such that the values are concentrated from teh range of 82 to 83, with 50% of the observation falling within this range.
Examining the distribution of the Grade variables also important. Since this is a categorical variable we will conduct frequency count and plot a barchart.
table(coffee_clean$Grade)
##
## Commodity Very Good Excellent Outstanding
## 112 1092 105 1
# Calculate the percentage of each grade
grade_percentages <- prop.table(table(coffee$Grade)) * 100
# Create a data frame with grade levels and percentages
grade_data <- data.frame(Grade = names(grade_percentages),
Percentage = as.numeric(grade_percentages))
# Reorder grade levels based on percentage
grade_data$Grades <- factor(grade_data$Grade, levels = grade_data$Grade[order(-grade_data$Percentage)])
# Create the plot using ggplot2
barplot_grades <- ggplot(grade_data, aes(x = Grades, y = Percentage, fill = Grades)) +
geom_bar(stat = "identity", alpha = 0.5, fill = "orange") +
labs(title = "Distribution of Coffee Grades",
x = "Grade", y = "Percentage") +
theme_bw() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "none")
# Display the plot
print(barplot_grades)
Grade is imbalanced, we can see that 80% of the observations fall under the “Very Good” factor level. Where as the “Outstanding” factor has only one observation. This create problems when conducting analysis because it can cause over fitting, poor minority class detection, and mislead our metric evaluation when checking the fitness of our model.Before building our model we must handle such imbalances, with different techniques.
We will be using various plotting function in order to create a visual understanding of the categorical variables. We will examine the frequency, the different grades that the factors of a categorical variable have, and we will see the distribution of the Total.Cup.Points for each category for any insight.
# This is a simple plot that is based on frequency
barplot_categorical <- function(data, category_variable, top_n) {
category_variable <- rlang::sym(category_variable)
data %>%
mutate(!!rlang::quo_name(category_variable) := as.character(!!category_variable)) %>%
group_by(!!category_variable) %>%
summarise(n = n(), .groups = "drop") %>%
arrange(desc(n)) %>%
slice_head(n = top_n) %>%
mutate(!!rlang::quo_name(category_variable) := fct_reorder(!!category_variable, n)) %>%
ggplot(aes(x = fct_reorder(!!category_variable, n), y = n, fill = "orange")) +
geom_bar(stat = "identity", show.legend = FALSE) +
labs(title = paste("Top", top_n, rlang::quo_name(category_variable)),
x = rlang::quo_name(category_variable),
y = "Frequency") +
scale_fill_manual(values = "orange") +
theme(
panel.background = element_rect(fill = "white"),
axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1, size = 15)) +
coord_flip()
}
# This is a category by category plot that will show us the different grades that are within the category variable. We can see which one is dominant and which one is present in the categorical variables we inspect.
barplot_by_grade <- function(data, category_variable, fill_variable, top_n) {
category_variable <- rlang::sym(category_variable)
fill_variable <- rlang::sym(fill_variable)
data %>%
mutate(
!!rlang::quo_name(category_variable) := as.character(!!category_variable),
!!rlang::quo_name(fill_variable) := as.factor(!!fill_variable)
) %>%
group_by(!!category_variable, !!fill_variable) %>%
summarise(n = n(), .groups = "drop") %>%
arrange(desc(n)) %>%
group_by(!!category_variable) %>%
mutate(total_freq = sum(n)) %>%
ungroup() %>%
mutate(!!rlang::quo_name(category_variable) := fct_reorder(!!category_variable, total_freq)) %>%
ggplot(aes(x = !!category_variable, y = n, fill = !!fill_variable)) +
geom_bar(stat = "identity") +
labs(
title = paste("Top", top_n, category_variable),
x = rlang::quo_name(category_variable),
y = "Frequency"
) +
scale_fill_manual(values = scales::brewer_pal(palette = "Oranges")(length(unique(data[[fill_variable]])))) +
theme(
panel.background = element_rect(fill = "white"),
axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1, size = 15),
#axis.title.x = element_text(size = 15),
#axis.title.y = element_text(size = 15),
#legend.title = element_text(size = 15),
#legend.text = element_text(size = 12),
legend.position = "right",
axis.ticks.length = unit(0.2, "cm"),
legend.background = element_rect(fill = "white", color = "black")
) +
coord_flip()
}
# This plot displays the range, the mean and the median of the Total.Cup.Points for the category we will be inspecting.
plot_summary_statistics <- function(data, categorical_variable, numerical_variable) {
ggplot(data = data) +
stat_summary(
mapping = aes(x = !!rlang::sym(categorical_variable), y = !!rlang::sym(numerical_variable)),
fun.min = min,
fun.max = max,
fun = median,
geom = "linerange",
color = "orange",
size = 0.5
) +
stat_summary(
mapping = aes(x = !!rlang::sym(categorical_variable), y = !!rlang::sym(numerical_variable)),
fun = mean,
geom = "point", # Use point for the mean
shape = 4, # Set point shape to X
color = "blue", # Set point color to blue
size = 1.5 # Adjust point size
) +
stat_summary(
mapping = aes(x = !!rlang::sym(categorical_variable), y = !!rlang::sym(numerical_variable)),
fun = median,
geom = "point", # Use point for the median dot
color = "black", # Set dot color to black
size = 1.5 # Adjust dot size
) +
theme_minimal() + # Set a minimal theme
theme(
panel.background = element_rect(fill = "white")
) +
coord_flip() # Flip the coordinates
}
# Visualizing the distribution, median and central tendency of the Total.Cup.Point to each categorical variables with the outliers.
boxplot_without_outliers <- function(data, categorical_variable, numerical_variable) {
ggplot(data = data) +
geom_boxplot(
mapping = aes_string(x = categorical_variable, y = numerical_variable),
color = "orange", # Set box color to orange
fill = "white", # Set fill color to white
width = 0.75,
outlier.shape = NA # Removing outliers
) +
stat_summary(
mapping = aes_string(x = categorical_variable, y = numerical_variable),
fun = median,
geom = "point", # Median will be represented by a black dot
color = "black",
size = 1.5
) +
theme_minimal() +
theme(
panel.background = element_rect(fill = "white"),
axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1, size = 15)
) +
coord_flip()
}
# Visualizing the distribution, median and central tendency of the Total.Cup.Point to each categorical variables with the outliers
boxplot_with_outliers <- function(data, categorical_variable, numerical_variable) {
ggplot(data = data) +
geom_boxplot(
mapping = aes_string(x = categorical_variable, y = numerical_variable),
color = "orange", # Set box color to orange
fill = "white", # Set fill color to white
width = 0.75 # Adjust box width if needed
) +
stat_summary(
mapping = aes_string(x = categorical_variable, y = numerical_variable),
fun = median,
geom = "point", # Use point for the median dot
position = position_dodge(width = 0.75), # Adjust dodge width if needed
color = "black", # Set dot color to black
size = 1.5 # Adjust dot size
) +
theme_minimal() + # Set a minimal theme
theme(
panel.background = element_rect(fill = "white"),
axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1, size = 15) # Set background to white
) +
coord_flip() # Flip the coordinates
}
#This function plots the mean of the Total.Cup.Points grouped by the categorical variables.
plot_mean_by_category <- function(data, category_variable, numerical_variable) {
# Ensure the variables are symbols
category_variable <- rlang::sym(category_variable)
numerical_variable <- rlang::sym(numerical_variable)
# Convert the numerical variable to numeric, and handle non-numeric values
data <- data %>%
mutate({{ numerical_variable }} := as.numeric(as.character({{ numerical_variable }})))
# If there are any NAs introduced by coercion, replace them with the mean of the non-NA values
if (any(is.na(data[[rlang::quo_name(numerical_variable)]]))) {
mean_value <- mean(data[[rlang::quo_name(numerical_variable)]], na.rm = TRUE)
data <- data %>%
mutate({{ numerical_variable }} := ifelse(is.na({{ numerical_variable }}), mean_value, {{ numerical_variable }}))
}
# Now you can summarise and plot your data with coord_flip()
data %>%
group_by({{ category_variable }}) %>%
summarise(mean_value = mean({{ numerical_variable }}, na.rm = TRUE)) %>%
arrange(desc(mean_value)) %>%
ggplot(aes(x = reorder({{ category_variable }}, mean_value), y = mean_value)) +
stat_summary(
fun = "mean",
geom = "point",
color = "orange",
size = 2
) +
theme_minimal() +
theme(
panel.background = element_rect(fill = "white"),
axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1, size = 15)
) +
coord_flip() # Add coord_flip() to flip the plot
}
unique(coffee_clean$Variety)
## [1] Unknown Other Bourbon
## [4] Catimor Ethiopian Yirgacheffe Caturra
## [7] SL14 Sumatra SL34
## [10] Hawaiian Kona Yellow Bourbon SL28
## [13] Gesha Catuai Pacamara
## [16] Typica Sumatra Lintong Mundo Novo
## [19] Java Peaberry Pacas
## [22] Mandheling Ruiru 11 Arusha
## [25] Ethiopian Heirlooms Moka Peaberry Sulawesi
## [28] Blue Mountain Marigojipe Pache Comun
## 30 Levels: Arusha Blue Mountain Bourbon Catimor Catuai ... Yellow Bourbon
length(unique(coffee_clean$Variety))
## [1] 30
We have 30 different types of variety in the data set. We will see which one is the most frequently occurring in other words the most popular variety, and how each variety is decomposed of the different grades. Viewing the distribution of each category with respect to the Total.Cup.Points is also important and finally we will group each category and calculate the mean of the Total.Cup.Points. This will be conducted for all of the categorical variables.
variety0 <- barplot_categorical(coffee_clean, "Variety", 30)
variety1 <- barplot_by_grade(coffee_clean, "Variety", "Grade", 30)
variety2 <- plot_summary_statistics(coffee_clean, "Variety", "Total.Cup.Points")
variety3 <- boxplot_without_outliers(coffee_clean, "Variety", "Total.Cup.Points")
variety4<- boxplot_with_outliers(coffee_clean, "Variety", "Total.Cup.Points")
variety5 <- plot_mean_by_category(coffee_clean, "Variety", "Total.Cup.Points")
combined_plot_variety0 = variety0 + variety5
combined_plot_variety1 = variety2 + variety3
combined_plot_variety2 = variety4 + variety1
print(combined_plot_variety0)
print(combined_plot_variety1)
print(combined_plot_variety2)
We could see that Caturra, Bourbon, Typica are the most occurring variables. We can also see that unknown values and other variety are within the factors. Also the outstanding coffee grade is in the missing variable. The Catuai variety, has the most variability when plotting the varieties against the Total Cup Points. Then followed by Caturra and Bourbon. Ethiopian Yigracheffe has the highest median point compared to the rest. Also a couple of the observation that are from the Unknown, Other and Bourbon, have the high total cup points.
As we saw since the data set is imbalance it would be best to visualize the mean of the different varieties. We can see that Ethiopian Yirgacheffe coffee variety followed by Sumatra Lington, SL34 and SL28 and the Sumatra have a higher mean in total cup points. What is interesting is that Sumatra Linton and its variation SL34, SL28 and Sumatra as well as SL14 all take from top 2 to top 6 of the coffee variety.
One thing that stands out is that there are levels within the categorical variable that do have low occurrence but are pertinent in determining the quality of the coffee. In order to show this let us create a table of the occurrences.
table(coffee_clean$Variety)
##
## Arusha Blue Mountain Bourbon
## 5 2 226
## Catimor Catuai Caturra
## 20 74 255
## Ethiopian Heirlooms Ethiopian Yirgacheffe Gesha
## 1 2 12
## Hawaiian Kona Java Mandheling
## 44 2 3
## Marigojipe Moka Peaberry Mundo Novo
## 1 1 33
## Other Pacamara Pacas
## 108 8 13
## Pache Comun Peaberry Ruiru 11
## 1 5 2
## SL14 SL28 SL34
## 17 15 8
## Sulawesi Sumatra Sumatra Lintong
## 1 3 1
## Typica Unknown Yellow Bourbon
## 211 201 35
For example, the Ethiopian Yigracheffe and the Sumatra Lintong have very few occurrences however they account for the one one of the highest mean of total cup points. This is not exclusive to only high grade coffee, we can also see that Java variety has only two occurrences but its mean total score is not positioned not high relative to other varieties. But beyond the rarity levels with fewer occurrences will pose challenges when splitting the data in the modeling stage. It is important to handle this before moving on the modeling stage. So we take note of this for now to hanle it at the end of this section.
unique(coffee_clean$Color)
## [1] Green Unknown Bluish-Green None Blue-Green
## Levels: Blue-Green Bluish-Green Green None Unknown
length(unique(coffee_clean$Color))
## [1] 5
There are five different unique values for the categorical variable color.
color0 <- barplot_categorical(coffee_clean, "Color", 5)
color1 <- barplot_by_grade(coffee_clean, "Color", "Grade", 5)
color2 <- plot_summary_statistics(coffee_clean, "Color", "Total.Cup.Points")
color3 <- boxplot_without_outliers(coffee_clean, "Color", "Total.Cup.Points")
color4<- boxplot_with_outliers(coffee_clean, "Color", "Total.Cup.Points")
color5 <- plot_mean_by_category(coffee_clean, "Color", "Total.Cup.Points")
combined_plot_color0 = color0 + color1
combined_plot_color1 = color2 + color3
combined_plot_color2 = color4 + color5
print(combined_plot_color0)
print(combined_plot_color1)
print(combined_plot_color2)
The most occurring type of color is green then followed by unknown and then bluish green and blue green. The Bluish Green and the Blue Green account for a small amount of commodity grades within them. The Green color accounts mostly for the Very Good whereas the Excellent are very few in all of the colors. Since outstanding is also barely present in the data we cannot visually tell apart in which color it falls.
As expected that the Green has a lot of variability followed by the None and then the Blue-Green then Bluish Green.The Blue-Green and the Blueish-Green color yield a higher mean total cup points. Followed by the Green and the None.
unique(coffee_clean$Processing.Method)
## [1] Washed / Wet Unknown
## [3] Natural / Dry Pulped natural / honey
## [5] Semi-washed / Semi-pulped Other
## 6 Levels: Natural / Dry Other ... Washed / Wet
length(unique(coffee_clean$Processing.Method))
## [1] 6
There are six different processes used for green beans.
processing_method0 <-barplot_categorical(coffee_clean, "Processing.Method", 6)
processing_method1 <- barplot_by_grade(coffee_clean, "Processing.Method", "Grade", 6)
processing_method2 <- plot_summary_statistics(coffee_clean, "Processing.Method", "Total.Cup.Points")
processing_method3 <- boxplot_without_outliers(coffee_clean, "Processing.Method", "Total.Cup.Points")
processing_method4<- boxplot_with_outliers(coffee_clean, "Processing.Method", "Total.Cup.Points")
processing_method5 <- plot_mean_by_category(coffee_clean, "Processing.Method", "Total.Cup.Points")
combined_plot_processing0 = processing_method0 + processing_method1
combined_plot_processing1 = processing_method2 + processing_method3
combined_plot_processing2 = processing_method4 + processing_method5
print(combined_plot_processing0)
print(combined_plot_processing1)
print(combined_plot_processing2)
The most common method that is used is the Washed/Wet process followed by the Natural Dry. Then there are less than 200 values that are missing within the processing method. Semi-washed or semi-pulped and pulped natural are not very common methods for processing coffee.
We can see that the washed/wet process and the natural dry process produce a fair amount of very good coffee grades. Where as the the missing and the semi-washed or semi pulped have fewer commodity than the rest of the processing method. The other methods pulped natural/honey methods only produce very good coffee.
The pupled/honey yields the highest mean in total cup points followed by the semi-washed and the washed.
unique(coffee_clean$Country.of.Origin)
## [1] Ethiopia Guatemala
## [3] Brazil Peru
## [5] United States United States (Hawaii)
## [7] Indonesia China
## [9] Costa Rica Mexico
## [11] Uganda Honduras
## [13] Taiwan Nicaragua
## [15] Tanzania, United Republic Of Kenya
## [17] Thailand Colombia
## [19] Panama Papua New Guinea
## [21] El Salvador Japan
## [23] Ecuador United States (Puerto Rico)
## [25] Haiti Burundi
## [27] Vietnam Philippines
## [29] Rwanda Malawi
## [31] Laos Zambia
## [33] Myanmar Mauritius
## [35] Cote d?Ivoire India
## 36 Levels: Brazil Burundi China Colombia Costa Rica Cote d?Ivoire ... Zambia
length(unique(coffee_clean$Country.of.Origin))
## [1] 36
There are 37 different countries in the data set.In the levels we can see that the level “Cote d?Ivoire” could be renamed.
country0 <- barplot_categorical(coffee_clean, "Country.of.Origin", 37)
country1 <- barplot_by_grade(coffee_clean, "Country.of.Origin", "Grade", 37)
country2 <- plot_summary_statistics(coffee_clean, "Country.of.Origin", "Total.Cup.Points")
country3 <- boxplot_without_outliers(coffee_clean, "Country.of.Origin", "Total.Cup.Points")
country4 <- boxplot_with_outliers(coffee_clean, "Country.of.Origin", "Total.Cup.Points")
country5 <- plot_mean_by_category(coffee_clean, "Country.of.Origin", "Total.Cup.Points")
combined_plot_country0 = country0 + country5
combined_plot_country1 = country2 + country3
combined_plot_country2 = country4 + country2
print(combined_plot_country0)
print(combined_plot_country1)
print(combined_plot_country2)
The most occurring country in the data set is Mexico, even though it is
not the largest producer of coffee. Followed by Columbia, Guatemala and
Brazil. It is evident that the data set has many South American
countries.When the data is separated it is evident which country has the
outstanding coffee which is Ethiopia. Another interesting insight is
that United States predominantly produces excellent coffee with a small
number graded as very good coffee.
When inspecting the box plot we can see that there is a great variability of grades for different countries. Further more when observing the mean we can see that Unites States, Papua New Guinea and Ethiopia are the top three in terms of high mean total cup points.
table(coffee$Country.of.Origin)
##
## Brazil Burundi
## 132 2
## China Colombia
## 16 183
## Costa Rica Cote d?Ivoire
## 51 1
## Ecuador El Salvador
## 1 21
## Ethiopia Guatemala
## 44 181
## Haiti Honduras
## 6 52
## India Indonesia
## 1 20
## Japan Kenya
## 1 25
## Laos Malawi
## 3 11
## Mauritius Mexico
## 1 236
## Myanmar Nicaragua
## 8 26
## Panama Papua New Guinea
## 4 1
## Peru Philippines
## 10 5
## Rwanda Taiwan
## 1 75
## Tanzania, United Republic Of Thailand
## 40 32
## Uganda United States
## 26 8
## United States (Hawaii) United States (Puerto Rico)
## 73 4
## Vietnam Zambia
## 7 1
As we saw from the visualizations there are some countries with few observation. These countries include Papau New Guinea which has only one observation but is also rated in the top when compared in terms of total cup score mean. On the other hand India has also only one observation but has the lowest mean total score. In this data set these countries are rare countries.
unique(coffee$In.Country.Partner)
## [1] "METAD Agricultural Development plc"
## [2] "Specialty Coffee Association"
## [3] "Specialty Coffee Institute of Asia"
## [4] "Ethiopia Commodity Exchange"
## [5] "Almacafé"
## [6] "Yunnan Coffee Exchange"
## [7] "Blossom Valley International"
## [8] "AMECAFE"
## [9] "NUCOFFEE"
## [10] "Uganda Coffee Development Authority"
## [11] "Instituto Hondureño del Café"
## [12] "Specialty Coffee Association of Costa Rica"
## [13] "Kenya Coffee Traders Association"
## [14] "Africa Fine Coffee Association"
## [15] "Asociacion Nacional Del Café"
## [16] "Centro Agroecológico del Café A.C."
## [17] "Salvadoran Coffee Council"
## [18] "Specialty Coffee Association of Indonesia"
## [19] "Brazil Specialty Coffee Association"
## [20] "Specialty Coffee Ass"
## [21] "Asociación Mexicana De Cafés y Cafeterías De Especialidad A.C."
## [22] "Tanzanian Coffee Board"
## [23] "Central De Organizaciones Productoras De Café y Cacao Del Perú - Central Café & Cacao"
## [24] "Torch Coffee Lab Yunnan"
## [25] "Coffee Quality Institute"
## [26] "Asociación de Cafés Especiales de Nicaragua"
## [27] "Blossom Valley International\n"
length(unique(coffee$In.Country.Partner))
## [1] 27
There are 27 in country partners. When evaluating the names we see that there are some levels that are the same and could be renamed in order to avoid redundancy. We can see that for example that Blossom Valley International appears twice, but due to difference in spelling or data entry error it is being considered as two levels. Another similar instance is for a level registered as “Specialty Coffee Ass” which needs further investigation to pinpoint it could be renamed to since countries have their own SCA.
partner0 <- barplot_categorical(coffee_clean, "In.Country.Partner", 27)
partner1 <- barplot_by_grade(coffee_clean, "In.Country.Partner", "Grade", 27)
partner2 <- plot_summary_statistics(coffee_clean, "In.Country.Partner", "Total.Cup.Points")
partner3 <- boxplot_without_outliers(coffee_clean, "In.Country.Partner", "Total.Cup.Points")
partner4 <- boxplot_with_outliers(coffee_clean, "In.Country.Partner", "Total.Cup.Points")
partner5 <- plot_mean_by_category(coffee, "In.Country.Partner", "Total.Cup.Points")
combined_plot_partner0 = partner0 + partner1
combined_plot_partner1 = partner2 + partner3
combined_plot_partner2 = partner4 + partner5
print(combined_plot_partner0)
print(combined_plot_partner1)
print(combined_plot_partner2)
table(coffee_clean$In.Country.Partner)
##
## Africa Fine Coffee Association
## 49
## Almacafé
## 178
## AMECAFE
## 205
## Asociación de Cafés Especiales de Nicaragua
## 8
## Asociación Mexicana De Cafés y Cafeterías De Especialidad A.C.
## 6
## Asociacion Nacional Del Café
## 155
## Blossom Valley International
## 58
## Blossom Valley International\n
## 1
## Brazil Specialty Coffee Association
## 67
## Central De Organizaciones Productoras De Café y Cacao Del Perú - Central Café & Cacao
## 1
## Centro Agroecológico del Café A.C.
## 8
## Coffee Quality Institute
## 7
## Ethiopia Commodity Exchange
## 18
## Instituto Hondureño del Café
## 59
## Kenya Coffee Traders Association
## 22
## METAD Agricultural Development plc
## 15
## NUCOFFEE
## 36
## Salvadoran Coffee Council
## 11
## Specialty Coffee Ass
## 1
## Specialty Coffee Association
## 295
## Specialty Coffee Association of Costa Rica
## 42
## Specialty Coffee Association of Indonesia
## 10
## Specialty Coffee Institute of Asia
## 16
## Tanzanian Coffee Board
## 6
## Torch Coffee Lab Yunnan
## 2
## Uganda Coffee Development Authority
## 22
## Yunnan Coffee Exchange
## 12
We see the same issue with the categorical variables for the partners as we saw in the variety, some of the levels have few occurrences. For example “Central De Organizaciones Productoras De Café y Cacao Del Perú - Central Café & Cacao” level has only one observation and this will also pose challenges when splitting the data thus it must be reviewed carefully.
In this section we will use histogram, density plot and box plot to see the structure of the numerical variables.
coffee_long <- coffee_clean %>%
mutate(across(everything(), as.character)) %>%
pivot_longer(cols = everything(),
names_to = "variable",
values_to = "value") %>%
filter(!is.na(as.numeric(value)))
The code above transforms the data frame into a long format, this means that each row will be a single observation for a variable, then it proceeded to filter out the rows that cannot be converted into numeric, in other words drops the categorical variables. Based on this transformation we will conduct our visual representation for the numeric variables.
coffee_histograms <- coffee_long %>%
filter(!is.na(as.numeric(value))) %>%
mutate(value = as.numeric(value)) %>%
ggplot(aes(x = value)) +
geom_histogram(bins = 20, color = "black", fill = "orange") +
facet_wrap(~variable, scales = "free", ncol = 4) +
labs(title = "Histograms of Numeric Variables",
x = "Value",
y = "Frequency") +
theme_minimal()
print(coffee_histograms)
coffee_density <- coffee_long %>%
filter(!is.na(value) & is.finite(as.numeric(value))) %>%
ggplot(aes(x = as.numeric(value))) +
geom_density(fill = "orange", color = "black") +
facet_wrap(~variable, scales = "free", ncol = 4) +
labs(title = "Density Plots of Numeric Variables with Normality Trace",
x = "Value",
y = "Density") +
theme_minimal()
print(coffee_density)
coffee_box_plots <- coffee_long %>%
filter(!is.na(value) & is.finite(as.numeric(value))) %>%
ggplot(aes(y = as.numeric(value))) +
geom_boxplot(fill = "orange", color = "black") +
facet_wrap(~variable, scales = "free", ncol = 4) +
labs(title = "Box Plots of Numeric Variables with Normality Trace",
x = "",
y = "Value") +
theme_minimal()
print(coffee_box_plots)
From the histogram we can see that the variables such as
Category.One.Defects, Category.Two.Defects, Quakers and Total.Weight is
right skewed with values concentrated on the smaller values. On the
other hand Sweetness, Uniformity and Clean Cup are left skewed with
values concentrated on the right side of the graph. Furthermore we can
see that Moisture is bi-modal indicating there are two distinct groups
with different coffee moisture levels. This is also prevalent when
visualizing the box plot. We can see that we have many of the variables
that have persistent outliers. Another thing that stands out the
variable altitude_mean_meters_new has low values that are close to zero.
This is worth investigating because altitude mean meters close to zero
does not make sense.
print(subset(coffee, altitude_mean_meters_new < 100, select = c("Country.of.Origin", "altitude_mean_meters_new", "In.Country.Partner", "Region")))
## Country.of.Origin altitude_mean_meters_new
## 101 Kenya 1
## 1069 Taiwan 50
## In.Country.Partner Region
## 101 Kenya Coffee Traders Association <NA>
## 1069 Blossom Valley International taiwan
These values stand out and upon examination we can see that the values has 1 is from Kenya and since the region is not specified there is no way to verify the actual value or an estimate of the value. For now it would be best to remove the instance.
# Remove the row where altitude_mean_meters_new is 1 and Country.of.Origin is 'Kenya'
coffee_clean <- subset(coffee_clean, !(altitude_mean_meters_new == 1 & Country.of.Origin == 'Kenya'))
Since most of the data is skewed and we have many instances of zeros in the data set we will visualize the data set by undertaking the log(x + 1) transformation. And see how the distribution changes and see if we can gain insights based on the transformation.
coffee_long <- coffee_clean %>%
mutate(across(everything(), as.character)) %>%
pivot_longer(cols = everything(),
names_to = "variable",
values_to = "value") %>%
filter(!is.na(as.numeric(value)))
## Warning: There was 1 warning in `filter()`.
## ℹ In argument: `!is.na(as.numeric(value))`.
## Caused by warning:
## ! NAs introduced by coercion
log_transform <- function(df) {
df %>%
mutate(across(where(is.numeric), ~ log(. + 1)))
}
coffee_histograms_log_transformed <- coffee_long %>%
filter(!is.na(as.numeric(value))) %>%
mutate(value = as.numeric(value)) %>%
log_transform() %>%
ggplot(aes(x = value)) +
geom_histogram(aes(y = ..density..), bins = 20, color = "black", fill = "orange") +
facet_wrap(~variable, scales = "free", ncol = 4) +
labs(title = "Log-transformed Histograms of Numeric Variables",
x = "Value",
y = "Density") +
theme_minimal()
print(coffee_histograms_log_transformed)
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
coffee_density_log_transformed <- coffee_long %>%
filter(!is.na(as.numeric(value))) %>%
mutate(value = as.numeric(value)) %>%
log_transform() %>%
ggplot(aes(x = value)) +
geom_density(fill = "orange", color = "black") +
facet_wrap(~variable, scales = "free", ncol = 4) +
labs(title = "Log-transformed Density Plots of Numeric Variables",
x = "Value",
y = "Density") +
theme_minimal()
print(coffee_density_log_transformed)
coffee_box_plots_log_transformed <- coffee_long %>%
filter(!is.na(as.numeric(value))) %>%
mutate(value = as.numeric(value)) %>%
log_transform() %>%
ggplot(aes(y = value)) +
geom_boxplot(fill = "orange", color = "black") +
facet_wrap(~variable, scales = "free", ncol = 4) +
labs(title = "Log-transformed Box Plots of Numeric Variables",
x = "",
y = "Value") +
theme_minimal()
print(coffee_box_plots_log_transformed)
After conducting the log(x + 1) transformation the distribution for
Category.Two.Defects and Total.Weight has changed. With both showing two
distinct groups within the distributions and not having any outliers in
that there are distinct groups within distributions. The transformation
has not affected Quakers, Clean.Cup, Sweetness and Uniformity. Also
there are outliers within the other values, but we will not treat them
as they are valid values and represent with great quality coffee of bad
quality coffee.
In this section we will explore the relationship of the different variables. We will use correlation plot and scatter plot to visualize and determine what the relationship is with amongst each other.
correlation_plot <- function(data) {
data %>%
# Select numerical columns
select(where(is.numeric)) %>%
# Compute the correlation matrix
cor(use = "complete.obs") %>%
# Convert the correlation matrix to a data frame
as.table() %>%
as.data.frame() %>%
# Create a scatter plot with correlation values
ggplot(aes(Var1, Var2, fill = Freq)) +
geom_tile(color = "white") +
scale_fill_gradient2(low = "blue", high = "red", mid = "white",
midpoint = 0, limit = c(-1, 1), space = "Lab",
name = "Correlation") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = "Correlation Plot of Numerical Values",
subtitle = "Color indicates correlation strength",
x = "Numerical Variables",
y = "Numerical Variables") +
geom_text(aes(Var1, Var2, label = sprintf("%.2f", Freq)),
color = "black", vjust = 1.5) +
coord_fixed() %>%
# Print the plot
print()
}
print(correlation_plot(coffee_clean))
## <ggproto object: Class CoordFixed, CoordCartesian, Coord, gg>
## aspect: function
## backtransform_range: function
## clip: on
## default: FALSE
## distance: function
## expand: TRUE
## is_free: function
## is_linear: function
## labels: function
## limits: list
## modify_scales: function
## range: function
## ratio: 1
## render_axis_h: function
## render_axis_v: function
## render_bg: function
## render_fg: function
## setup_data: function
## setup_layout: function
## setup_panel_guides: function
## setup_panel_params: function
## setup_params: function
## train_panel_guides: function
## transform: function
## super: <ggproto object: Class CoordFixed, CoordCartesian, Coord, gg>
Inspecting the correlation plot, it is evident that the values that are used to calculate the Total.Cup.Points such as the Aroma, Flavor, Aftertaste, Acidity, Body, Balance, Uniformity, Clean.Cup, Sweetness and Cupper.Points (subjective parameters) are all highly correlated with each other and with Total.Cup.Points. Whereas the objective parameters such as altitude, moisture the defects and the total weights indicate some correlation but not as strong. We will see if our transformation from the log has any impact on the correlation plot and compare them without the correlation plot without any transformation.
correlation_plot_log <- function(data) {
data %>%
# Select numerical columns
select(where(is.numeric)) %>%
# Apply log(x + 1) transformation
mutate(across(everything(), ~log(.x + 1))) %>%
# Compute the correlation matrix
cor(use = "complete.obs") %>%
# Convert the correlation matrix to a data frame
as.table() %>%
as.data.frame() %>%
# Create a scatter plot with correlation values
ggplot(aes(Var1, Var2, fill = Freq)) +
geom_tile(color = "white") +
scale_fill_gradient2(low = "blue", high = "red", mid = "white",
midpoint = 0, limit = c(-1, 1), space = "Lab",
name = "Correlation") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = "Correlation Plot of Numerical Values",
subtitle = "Color indicates correlation strength",
x = "Numerical Variables",
y = "Numerical Variables") +
geom_text(aes(Var1, Var2, label = sprintf("%.2f", Freq)),
color = "black", vjust = 1.5) +
coord_fixed() %>%
# Print the plot
print()
}
print(correlation_plot_log(coffee_clean))
## <ggproto object: Class CoordFixed, CoordCartesian, Coord, gg>
## aspect: function
## backtransform_range: function
## clip: on
## default: FALSE
## distance: function
## expand: TRUE
## is_free: function
## is_linear: function
## labels: function
## limits: list
## modify_scales: function
## range: function
## ratio: 1
## render_axis_h: function
## render_axis_v: function
## render_bg: function
## render_fg: function
## setup_data: function
## setup_layout: function
## setup_panel_guides: function
## setup_panel_params: function
## setup_params: function
## train_panel_guides: function
## transform: function
## super: <ggproto object: Class CoordFixed, CoordCartesian, Coord, gg>
From the transformation, it is evident that the data that are skewed
such as Category.One.Defects, Category.Two.Defects and Total.Weight
variables have improved in correlation with Total.Cup.Points which is
used to derive the variable Grade. In this manner these values are right
skewed and treating them with the log transformation. However on the
other side altitude_mean_meters_new decreases with the log
transformation thus it will not be treated. This is expected as the
variables are right skewed and this will allow us to mitigate the
extreme values.
coffee_clean <- coffee_clean %>%
mutate(
Log.Total.Weight = log(Total.Weight + 1),
Log.Category.One.Defects = log(Category.One.Defects + 1),
Log.Category.Two.Defects = log(Category.Two.Defects + 1))
In the following section we define the objective parameters as variables or features that are not used in the calculation for the Total.Cup.Points and are not based on the Q-graders personal opinions which may involve personal bias or variability among different observers. The objective parameters are values determined through ovservations, measurements or analysis and are measured based on facts.
# Define the relevant columns and color factor
objective_parameters <- c("Total.Cup.Points", "Moisture", "Log.Category.One.Defects", "Quakers", "Log.Category.Two.Defects", "altitude_mean_meters_new", "Log.Total.Weight")
color_factor <- "Grade"
color_levels <- levels(coffee_clean[[color_factor]])
color_scale <- RColorBrewer::brewer.pal(length(color_levels), "Set1")
create_scatter_plot_matrix <- function(data, relevant_columns, color_factor, plot_title) {
# Define color levels and scale
color_levels <- levels(data[[color_factor]])
color_scale <- RColorBrewer::brewer.pal(length(color_levels), "Set1")
# Create scatter plot matrix using ggpairs
data %>%
select(all_of(c(relevant_columns, color_factor))) %>%
mutate(across(all_of(color_factor), as.factor)) %>%
na.omit() %>%
mutate(count = ave(seq(nrow(.)), .[[color_factor]], FUN=length)) %>%
filter(count >= 2) %>%
select(-count) %>%
GGally::ggpairs(
columns = 1:(ncol(.) - 1), # Exclude color factor
upper = list(mapping = aes_string(color = color_factor, fill = color_factor)),
lower = list(mapping = aes_string(color = color_factor, fill = color_factor)),
diag = list(continuous = wrap("densityDiag", color = "orange")), # Exclude distribution of numerical variables
title = plot_title
) +
ggplot2::theme_minimal() +
ggplot2::theme(
text = ggplot2::element_text(size = 12),
panel.grid.major = ggplot2::element_blank(),
panel.grid.minor = ggplot2::element_blank(),
axis.line = ggplot2::element_line(color = "black")
) +
ggplot2::scale_color_manual(values = color_scale, name = paste("Legend: ", color_levels)) +
ggplot2::scale_fill_manual(values = color_scale, name = paste("Legend: ", color_levels)) +
ggplot2::theme(legend.position = "top") -> scatter_plot_matrix
# Print the scatter plot matrix
print(scatter_plot_matrix)
}
create_scatter_plot_matrix(coffee_clean, objective_parameters, color_factor, "Objective Paramters Numerical")
From the scatter plot we can see that there is a weak positive correlation of Total.Cup.Points with altitude_mean_meters_new and Log.Total.Weight. What is surprising is that the correlation between Total.Cup.Points and Quakers has a very weak positive correlation. A more plausible assumption would of been that there would be a negative correlation and that more Quakers would reduce the quality or the grade of the coffee. On the other hand Log.Category.One.Defect, Log.Category.Two.Defect and Moisture are weakly negatively correlated. Form the scatter plot there are no linear relationship that is visible and no clear patterns that we can easily recognize. Let us examine to see if different plots have different plots.
create_scatter_plot_matrix_2 <- function(data, relevant_columns, color_factor, plot_title) {
# Define color levels and scale
color_levels <- levels(data[[color_factor]])
color_scale <- RColorBrewer::brewer.pal(length(color_levels), "Set1")
# Create scatter plot matrix using ggpairs
data %>%
select(all_of(c(relevant_columns, color_factor))) %>%
mutate(across(all_of(color_factor), as.factor)) %>%
na.omit() %>%
mutate(count = ave(seq(nrow(.)), .[[color_factor]], FUN=length)) %>%
filter(count >= 2) %>%
select(-count) %>%
GGally::ggpairs(
columns = 1:(ncol(.) - 1), # Exclude color factor
upper = list(continuous = wrap("smooth", method = "lm", se = FALSE, color = "orange", size = 1),
mapping = aes_string(color = color_factor, fill = color_factor)),
diag = list(continuous = wrap("densityDiag", color = "orange")),
lower = list(mapping = aes_string(color = color_factor, fill = color_factor)),
title = plot_title
) +
ggplot2::theme_minimal() +
ggplot2::theme(
text = ggplot2::element_text(size = 12),
panel.grid.major = ggplot2::element_blank(),
panel.grid.minor = ggplot2::element_blank(),
axis.line = ggplot2::element_line(color = "black")
) +
ggplot2::scale_color_manual(values = color_scale, name = paste("Legend: ", color_levels)) +
ggplot2::scale_fill_manual(values = color_scale, name = paste("Legend: ", color_levels)) +
ggplot2::theme(legend.position = "top") -> scatter_plot_matrix
# Print the scatter plot matrix
print(scatter_plot_matrix)
}
create_scatter_plot_matrix_2(coffee_clean, objective_parameters, color_factor, "Objective Parameters Numerical and Linear Relationships of Grades")
Adding the lines to the scatter plot of the different grades of coffee we can see how each factor level behaves and what ranges of values it takes on. For instance, for the moisture predictor, we can see that excellent grades values start from zero but do not extend to the entire range of values. Where as the coffee grade very good and commodity extend to the entire range of values. Similarly for quakers the commodity grade has a weak positive correlation but doesn’t extend to the full range of values that are available in the graph. The excellent grade coffee has a negative correlation and also doesn’t extend to the entire range of the variable. Whereas the altitude has the excellent coffee grade starts a little higher values and extends to the maximum value of the variable. On the other hand commodity start off little higher values but doesn’t extend to the full values of the predictor.
The section above investigated the objective parameters in this section we see the values that are used to calculate the Total.Cup.Points and how each of the different variable interact with each other. Since these values are based on the Q-Graders personal preferences and are used to calculate the Total.Cup.Points we would expect to see a strong linear relationship with each other.
subjective_parameters <-c("Total.Cup.Points", "Aroma", "Flavor", "Aftertaste", "Acidity", "Body", "Balance", "Uniformity", "Clean.Cup", "Sweetness", "Cupper.Points")
create_scatter_plot_matrix(coffee_clean, subjective_parameters, color_factor, "Subjective Parameters Numerical")
As oppose to the objective parameters we can see that there is a clear linear relationship with the subjective parameters, especially for the Aroma, Flavor, Aftertaste, Body, Balance and Cupper.Points. The plot also show us that these variables are correlated amongst each other. Even though we cannot distinctly easily visualize that the positive correlation for Uniformity. Clean.Cup and Sweetness the correlation indicated that there is positive correlation with Total.Cup.Points but not as strong as the ones formerly stated. These variables are also not as correlated with the other variables and with each other as the ones stated above. To see how the different grades range within the variables, we will use similar approach as the objective parameters and draw the corresponding linear relationship of each of the variable.
create_scatter_plot_matrix_2(coffee_clean, subjective_parameters, color_factor, "Subjective Parameters Numerical with Linear Relationships of Grades")